PhiHER2: phenotype-informed weakly supervised model for HER2 status prediction from pathological images

Abstract Motivation Human epidermal growth factor receptor 2 (HER2) status identification enables physicians to assess the prognosis risk and determine the treatment schedule for patients. In clinical practice, pathological slides serve as the gold standard, offering morphological information on cellular structure and tumoral regions. Computational analysis of pathological images has the potential to discover morphological patterns associated with HER2 molecular targets and achieve precise status prediction. However, pathological images are typically equipped with high-resolution attributes, and HER2 expression in breast cancer (BC) images often manifests the intratumoral heterogeneity. Results We present a phenotype-informed weakly supervised multiple instance learning architecture (PhiHER2) for the prediction of the HER2 status from pathological images of BC. Specifically, a hierarchical prototype clustering module is designed to identify representative phenotypes across whole slide images. These phenotype embeddings are then integrated into a cross-attention module, enhancing feature interaction and aggregation on instances. This yields a phenotype-based feature space that leverages the intratumoral morphological heterogeneity for HER2 status prediction. Extensive results demonstrate that PhiHER2 captures a better WSI-level representation by the typical phenotype guidance and significantly outperforms existing methods on real-world datasets. Additionally, interpretability analyses of both phenotypes and WSIs provide explicit insights into the heterogeneity of morphological patterns associated with molecular HER2 status. Availability and implementation Our model is available at https://github.com/lyotvincent/PhiHER2


Introduction
HER2 (human epidermal growth factor receptor 2) is a crucial protein in the regulation of cell growth and division.Anomalies in HER2 expression are intricately related to the occurrence, development, and progression of various cancers, particularly breast cancer (BC) (English et al. 2013).Therefore, precise determination of HER2 status proves beneficial for clinicians in crafting individualized treatment plans, improving treatment efficacy, and ultimately improving patient survival rates.Immunohistochemistry (IHC) and FISH are two principal diagnostic techniques that primarily prevail in clinical settings (Wolff et al. 2018).Nonetheless, detecting HER2 expression remains a complex task.The diagnostic testing process requires a substantial expenditure of human resources and materials.Notably, specialized technicians are indispensable for the handling and staining of samples, causing to the overall expensive cost of testing.Furthermore, the level of HER2 expression from IHC testing is often qualitative or semi-quantitative (Chae et al. 2017), which introduces an element of subjectivity and uncertainty.This can potentially affect the test results and make it difficult to keep consistency among pathologists.Therefore, an efficient, accurate, and fully quantitative method is highly required for HER2 status detection in BC.
Histopathology serves as the gold standard for clinical diagnosis.The H&E pathological slides provide morphological information, accompanied by spatial organization.Nowadays, with the advent of deep learning technology and artificial intelligence, computational pathology (CPath) has experienced significant advancement by digital whole slide image (WSI) analysis (Song et al. 2023).It has shown remarkable progress in various tasks, including nuclei and tissue identification, tumor detection, cancer grading, subtyping, survival prediction, treatment response prediction, and prediction of molecular targets, among others (Laleh et al. 2022, Huang et al. 2023).Pathological WSI analysis is crucial, but it also offers fundamental challenges.WSIs are equipped with high-resolution and multi-scale attributes, featuring sizes that can reach up to billions of pixels.This makes it infeasible for standard CNNs to identify regions of interest (ROIs) and aggregate features.Also, it is essential to uncover morphological phenotypes and subvisual characteristics, especially in the context of tasks involving molecular-omics targets.In previous studies, Ding et al. (2022) proposed a graph neural network model to predict the cross-level molecular profiles of genetic mutations, copy number alterations, and functional protein expressions in colon cancer WSIs.They provided a graph structure interpretation scheme and demonstrated a wide range of molecular-histopathological associations.Lazard et al. (2022) introduced a deep learning approach to predict homologous recombination deficiency from BC WSIs and the phenotypic patterns relevant to genotype-phenotype relationships was analyzed.Binder et al. (2021) presented an explainable machine learning framework to profile molecular and clinical features from BC histology, which facilitated the assessment of the link between morphological and molecular properties.MOMA framework (Tsai et al. 2023) has been proposed to predict the genomics and proteomics status of cancer samples from WSIs.It identified interpretable patterns predictive of genetic profiles and multiomics aberrations.DEMoS model (Wang et al. 2022) was designed to predict molecular subtypes of microsatellite instability (MSI) on gastric cancer and a transformer model was developed for MSI status prediction in colorectal cancer WSIs.The model has been proved to learn morphological concepts associated with MSI-high prediction (Wagner et al. 2023).
Specifically, for automated prediction of HER2 status from pathological images, Farahmand et al. (2022) proposed an inception-v4 model on the basis of manually annotated tumor ROIs, while Hossain et al. (2023) developed an automated tumor ROI detection method.Both of them achieved HER2 status prediction, but they are equipped with a full supervised learning strategy, causing a tremendous computational cost on billions of patches, compared to weakly supervised learning methods.Pisula et al. (2023) implemented a weakly supervised multiple instance learning (MIL) method for IHC-stained tissue slides and an attention-based MIL approach ReceptorNet (Naik et al. 2020) was also performed for hormonal receptor and HER2 status prediction on H&E WSIs.Even, a graph neural network architecture was introduced for HER2 status prediction from WSIs of BC tissue (Lu et al. 2022).However, these approaches overlooked the intratumoral heterogeneity of HER2 expression (Seol et al. 2012) despite achieving remarkable accuracy in prediction capabilities.HER2 intratumoral heterogeneity manifests as the coexistence of positive and negative cells in tumor sections, which could potentially result in resistance to HER2-targeted therapies (Swain et al. 2023).This prompts us to explore how to discover morphologically representative and heterogeneous content from pathological images.Such morphological phenotypes will provide further hints for the precise prediction of HER2 status, which we refer to as phenotype-informed prediction.Recently, a vocabulary-based MIL paradigm has been introduced for the detection of lymph node metastasis and a prototype discovery module was designed to capture pathology patterns (Yu et al. 2023).The concept of prototype has been established to show how it can be interpreted biologically (Vu et al. 2023).This provides us with the potential to identify morphological phenotypes and reveal characteristics of intratumoral heterogeneity for HER2.
In this work, we present a phenotype-informed multiple instance learning architecture (PhiHER2) for automated HER2 status prediction of BC from H&E-stained WSIs.The PhiHER2 model leverages weakly supervised learning and is applicable to high-resolution WSIs, enabling the automatic identification of key regions without the requirement for manual ROI annotation.A hierarchical prototype clustering module is firstly designed to discover phenotypes that reveal the morphological patterns exhibited across pathological WSIs.Phenotype embeddings are then introduced into the MIL framework for feature aggregation with the crossattention module.It enhances interaction among instances and captures a phenotype-based feature space equipped with intratumoral heterogeneity for HER2 status prediction.An overlapping heatmap visualization approach is developed for the interpretability analysis on WSIs.It uncovers the associations of morphological patterns with molecular HER2 status.The experimental results on two real-world datasets highlight that our proposed model effectively accounts for instance aggregation, captures WSI-level representations, and significantly improves the model performance.

Materials and methods
Figure 1a and b illustrates the overall flowchart of our phenotype-informed framework.The workflow is based on weakly supervised learning MIL approach, including WSIs processing, feature extraction, hierarchical prototype clustering, the cross-attention module for HER2 status prediction classifier, along with the dual instance sampling strategy.In the following section, the problem will be firstly formulated, and our components will be described in detail.

Problem formulation
Given a WSI dataset X ¼ fX i g N i¼1 containing N slides.Each slide X i is tiled into patches ft i j g ni j¼1 , where i and j denote the index of the slide and the patch, respectively.n i is the number of patches of X i .In the concept of MIL, each patch t i j is known as an instance and the set of patches ft i j g ni j¼1 is a bag.For a classification task, there only exists known bag-level label Y ¼ fY i g N i¼1 and the corresponding instance-level labels fy i j g ni j¼1 for patches are unknown.Our goal is to learn a MIL model F ð�Þ to predict the bag-level category Ŷi from this bag X i : F ð�Þ can be further decomposed into three parts, the transformation function T ð�Þ on instances, the aggregation function Sð�Þ, and the bag-level classifier head Rð�Þ T ð�Þ function transforms each instance t i j as a feature vector p i j with dimensionality d 1 and Sð�Þ aggregates all instance embeddings P i ¼ fp i j g ni j¼1 into a bag representation vector s i , which is used for predicting Ŷi by the classifier function Rð�Þ.
Typically, T ð�Þ and Rð�Þ could be parameterized by neural networks optimization.The key challenge in MIL lies in how to deal with the relationships between instances in one bag in Sð�Þ so that the model can represent efficiently and make accurate predictions.We here introduce a new function ϕð�Þ into the feature aggregation function Sð�Þ for guidance ϕð�Þ is capable of discovering L phenotypes fp k g L k¼1 from instance embeddings among WSIs fP i g N i¼1 .

Whole slide images processing
The objective of processing WSIs is to filter out background areas that are not tissue, and to tile WSIs containing billions of pixels into small patches for MIL.Here, the WSIs processing method in CLAM (Lu et al. 2021) is utilized for tissue detection.The non-overlapping sliding window tiling process is then executed within the detected region contours at the highest magnification, with each patch sized at 256 × 256 pixels (Fig. 1a).The number of patches obtained from WSIs could range from thousands to hundreds of thousands.Figure 1c and d illustrates the distribution of the number of patches.PhiHER2: phenotype-informed model for HER2 status prediction from WSIs i81

Hierarchical prototype clustering
The prototype clustering is the function ϕð�Þ we introduce to the typical MIL framework.It is designed to minimize noise and capture representative patterns that reveal the underlying tissue heterogeneity.This helps to guide feature aggregation Sð�Þ and obtain a WSI-level representation for the accurate prediction of HER2 status.Considering the number of instances, we develop a two-stage hierarchical prototype clustering module based on unsupervised Affinity Propagation (AP) clustering algorithm (Frey and Dueck 2007).By this way, our module can be insensitive to the initial values of the data and capable of identifying the optimal number of clusters when processing WSIs with a diverse range of instances.The hierarchical prototype clustering module is a stacked structure with multiple AP cluster classes.Abstractly, taking the fP i g N i¼1 as input, the first-stage AP cluster M ap i is iteratively conducted on each bag with instance features P i , ultimately yielding the cluster centers fp i m g ci m¼1 , where m is the index fp i m g ci m¼1 is a subset of instance features P i .c i represents the number of cluster centers derived from the bag features P i , and c i is significantly smaller than the number of instances n i .Subsequently, the second-stage AP cluster M ap a is applied to cluster centers C N across WSIs C N is the union of fp i m g ci m¼1 from the first-stage AP cluster, and fp k g L k¼1 2 R L×d 1 denotes the final centers after clustering on C N .In this way, a collection of L unique phenotypes is discovered from WSIs, revealing diverse morphological patterns.
In the implementation of the AP cluster algorithm, we employ the power operation of the negative normalized Euclidean distance (Yu et al. 2023) to measure the similarity between the data points in both stages.We set the damping factor to 0.5 to avoid numerical oscillations during the iteration process, as mentioned in Frey and Dueck (2007).It is noteworthy that the hierarchical prototype clustering module is performed on the training data only.When inferring on the test data points, the pre-clustered prototype representation is loaded and utilized for guidance.

Feature aggregation with cross-attention
Cross-attention is an attention mechanism within the transformer architecture (Vaswani et al. 2017) that integrates two distinct embedding sequences.To learn a better understanding of tissue heterogeneity with phenotype embeddings, we adapt the cross-attention module for the feature aggregation Sð�Þ in our MIL.
Specifically, for the input instance features P i 2 R ni×d 1 of a WSI and phenotype embeddings fp k g L k¼1 2 R L×d 1 , a weightshared fully connected projection layer f p ð�Þ is firstly utilized to reduce dimensionality, obtaining Pi and f pk g L k¼1 , respectively Then, they are sent to the cross-attention part for interaction.The cross-attention module includes inputs with Key and Query, where the phenotype embeddings f pk g L k¼1 are treated as Query, and instance features Pi are represented as Key.Query and Key pass through respective linear transformation layers w q ð�Þ and w k ð�Þ to obtain the Q and K i matrices.At this point, the product of the two matrices yields a new measurement matrix M i , followed by a normalized operation Here, jj � jj 2 represents the L2 norm and matrix K i is transposed when computing.In the standard transformer architecture, this measurement matrix serves as the attention coefficients, which are utilized to compute a new representation by multiplying with the Key.In our approach, guided by the phenotype embeddings fp k g L k¼1 , each column of the measurement matrix fM i j g ni j¼1 has been projected into a new feature space R L corresponding to a specific set of embeddings Q.Consequently, we take this transformed matrix M i as the bag representation s i for the image X i .This facilitates interaction and integration between phenotype embeddings fp k g L k¼1 with instance features P i from each WSI.

HER2 status classifier head and loss optimization
On the basis of the phenotype-based bag-level representation s i , different classifier heads Rð�Þ can be applied to matrix M i for HER2 status prediction.We explore classifiers including mean-based, attention score-based, and transformerbased heads.
� The mean-based linear head can be described as an average on the representation matrix M i along the axis of instances, resulting in a vector � M i 2 R L .This vector is passed into the MLP layer for predicting Ŷi .The MLP layer consists of a linear layer, a ReLU activation function and a linear output layer followed by a softmax function.
� The attention score-based fusion classifier is denoted as follows: learning instance-level attention scores on the matrix M i , operating a weighted sum on them (Ilse et al. 2018), and predicting Ŷi by the MLP layer.� The transformer classifier consists of a transformer encoder (Vaswani et al. 2017) with an optimized class token and the multi-head self-attention mechanism.It uses a fully connected layer to predict Ŷi based on the class token.
We employ cross-entropy loss as the overall objective function Lð�Þ to optimize our PhiHER2 model i82 Yan et al.
where B is the batch size, Y i denotes the slide-level label of HER2 status, and Ŷi represents the predicted probability for this WSI.θ S and θ R are the parameters in Sð�Þ and Rð�Þ, and argminð�Þ aims to find the parameters that minimize the function Lð�Þ.

Dual instance sampling
In our work, we introduce a dual instance sampling strategy designed to function as a bag-level data augmentation technique in MIL.This strategy is composed of two components: random sampling and iterative instance mining.
Initially, for a given bag X i in the dataset, we randomly sample a subset with R instance embeddings fp i j g R j¼1 � fp i j g ni j¼1 to be included in each training epoch process.Subsequently, we employ a score-based instance selector to mine S instance embeddings fp i j g S j¼1 � fp i j g R j¼1 with high scores during the iteratively training process.Here, R and S are pre-defined numbers.The score-based instance selector utilizes a linear layer to compute vector scores for iterative mining, and it is positioned prior to the weight-shared fully connected projection layer f p ð�Þ (defined in Section 2.5).The dual instance sampling strategy is engineered to be a plugand-play component, ensuring the simplicity of our PhiHER2 architecture is maintained.

Data description
The HEROHE dataset was from the HEROHE challenge, which aimed to predict HER2 status directly from H&Estained pathological images (Conde-Sousa et al. 2022).It includes a training dataset and an independent test dataset.
The training dataset consists of 360 WSIs of invasive BC tissue samples, and the test dataset contains 150 WSIs.The corresponding slide-level ground truth labels (positive or negative) were derived from IHC and ISH tests according to clinical practice guidelines (Wolff et al. 2018).All slides in both training and test datasets were scanned at 20× magnification with 0.243 μm per pixel and originated from different patients.No tumor region annotations or IHC slides were provided.
The Yale HER2 cohort was also collected (Farahmand et al. 2022).It contains 191 HER2 positive and negative invasive BC samples, including 93 positive and 98 negative slides.The H&E-stained WSIs generated at Yale School of Medicine were scanned by Aperio ScanScope at 20× magnification with 0.497 μm per pixel.They were annotated with ROIs associated to tumor areas on invasive carcinoma by a senior breast pathologist and the ROI annotation files were included in the dataset.Please keep in mind that the ROIs mentioned here were not used in our training process.Instead, we took them for comparative analysis with slidelevel heatmap visualization to interpret the results.

Experimental designs
Implementation details and evaluation criteria can be found in the supplementary materials (Supplementary Sections S1 and S2).

Comparison with state-of-the-art methods
To verify the effectiveness of our PhiHER2 model, we chose several state-of-the-art weakly supervised methods (ABMIL (Ilse et al. 2018), CLAM (Lu et al. 2021), PMIL (Yu et al. 2023), and TRANS (Wagner et al. 2023)) for comparison.Brief introductions of these methods are listed in Supplementary Section S3.We conducted a systematic evaluation on the performance of our PhiHER2 against comparative methods.The quantitative results on positive F1-score (posF1), weighted averaged Precision (wPRC), Recall (wREC), F1-score (wF1), and balanced accuracy (bACC) are shown in Table 1.The evaluation analysis involved ROC and PR curves, along with average AUC and AUPRC values, are illustrated in Fig. 2a-d.The comparison of inference time cost is reported in Supplementary Table S1, and the performance of our model across 5-time experiments is depicted in Supplementary Fig. S2.
The results demonstrate superior performance of our PhiHER2 in accurately identifying HER2 status from BC WSIs.It achieved the best AUC values of 0.795 and 0.890 on the HEROHE dataset and the Yale cohort, respectively (Fig. 2a-b).Moreover, the average AUPRC value of 0.881 (Fig. 2d) shows that our model achieved the best recall with precision on the Yale cohort.Due to the strong class imbalance ratio of 1.5 in the HEROHE dataset, models' AUPRCs were relatively inferior compared to the AUCs.Nevertheless, our proposed PhiHER2 still attained the highest AUPRC value of 0.665 (Fig. 2c).The baseline ABMIL and the CLAM method, enhanced with instance attention scores, yielded acceptable but unremarkable results for the specific molecular HER2 status prediction task on HEROHE and Yale WSIs (Table 1, Fig. 2a-d).The TRANS method underperformed other comparative methods.We attribute this observation to the data volume, considering their collection of over 20 000 H&E slides (Wagner et al. 2023).Moreover, the TRANS model, functioning as a general transformer-based pipeline for end-to-end biomarker prediction, falls short in accounting for the heterogeneity of morphological patterns.The PMIL-Cosine and PMIL-Euclidean methods outperformed CLAM and ABMIL on both datasets.This points out the advantages of phenotype-informed architecture.We further categorized the comparative models into two groups (phenotype-guided versus non-phenotype-guided) and reported the evaluation performance (Supplementary Fig. S3).We found that the first group consistently provides better performance, confirming the superiority of the concept of phenotype guidance.It is noteworthy that PhiHER2 displayed lower variance than the PMIL (Table 1).This indicates that our model has a capability in dealing with data bias and excels in accuracy and robustness when predicting HER2 status from WSIs.
Given that the HEROHE dataset was from a competition (Conde-Sousa et al. 2022), we benchmarked our results against the top entries on the public leaderboard (Supplementary Table S2).Our method outperformed the first-place entry, achieving a 2.6% improvement (0.706 versus 0.68).In terms of the AUC metric, our method obtained a notable improvement of 8.5% (0.795 versus 0.71) over the first-place entry.For the Yale cohort, we compared with the results presented in Farahmand et al. (2022).Our model significantly outperformed their unannotated two-way classifier (AUC values: 0.890 versus 0.82).In particular, our model operating without any tumor ROIs annotation, achieved or even exceeded the performance of their tumor-annotation-based models, with AUC value of 0.890 compared to 0.89 for their annotated two-way classifier and 0.88 for the annotated threeway classifier.This implicitly suggests that our model can identify key tumor regions.We will provide a detailed interpretability analysis of these findings in Section 3.7.

Qualitative evaluation
We investigated the distribution of predictions on different HER2 status groups.The predicted probabilities for HER2positive on the HEROHE test cases were examined (Fig. 2e).It could be observed that our method exhibits the widest gap between the median values of the prediction probabilities across different HER2 status groups.Our model is capable of a better alignment between the prediction and the true HER2 status, with cases of the negative being more frequently predicted towards the lower end of the scale, and those of the positive being more often predicted towards the upper end.These findings suggest that our PhiHER2 has the most significant discrimination between HER2-positive and negative cases, highlighting the effectiveness of our method.

Ablation study
To validate the effectiveness of our proposed hierarchical prototype clustering structure and the cross-attention module, we adopted four distinct experimental configurations for comparison.
(i) Cluster-PT: The model with a cross-attention module and phenotype embeddings derived from hierarchical prototype clustering structure to provide guidance.(ii) Rand-PT: Fixed random vectors served as phenotype embeddings in the cross-attention module for direct inference.Random values were sampled from a uniform distribution, with dimensions and quantity identical to those of the cluster phenotype embeddings.(iii) Initial-PT: Random vectors initialized with a uniform distribution served as phenotype embeddings.These vectors could be iteratively optimized during the training process.(iv) Non-PT: The model with a cross-attention module but no phenotype guidance.In this configuration, the cross-attention module reverts to a self-attention mechanism.We also included the ABMIL method as baseline for comparison.
The architectures of experimental configurations and the comparative models for the following subsections are further illustrated in Supplementary Fig. S4.The results for these models are reported in Fig. 3a and Supplementary Table S3.

Effectiveness of phenotype guidance
We can clearly see that Cluster-PT consistently outperformed the baseline ABMIL and Non-PT on both the HEROHE set and the Yale cohort.Cluster-PT obtained an average AUC value of 0.893, marking a significant improvement of 6.7% over Non-PT and 7.3% over the baseline on the Yale cohort (Supplementary Table S3).A comparable enhancement in performance is also evident in the results on the HEROHE dataset.Taken together, these results reinforce that phenotype guidance in Cluster-PT is effective in exploring relations between representative embeddings and the instance features of WSIs, thereby aggregating distinguishing representations.

Effectiveness of hierarchical prototype clustering
Cluster-PT approach achieved superior performance compared to both Rand-PT and Initial-PT, with respective AUC improvements of 10.1% and 1.0% on the HEROHE dataset (Supplementary Table S3).It suggests that phenotypes can be enhanced through the incorporation of our designed hierarchical prototype clustering.The tissue phenotypes corresponding to the prototype embeddings derived from the hierarchical prototype clustering module on both datasets is provided (Fig. 3b and e, Supplementary Fig. S5).We found that the clustering module captures typical phenotypes within WSIs.Expert pathologists characterized these regional tissue phenotypes as dense tumor cellularity, a discrete distribution of tumor cells, abundant stromal fibrosis, the presence of tumor-infiltrating lymphocytes, and white background.
We employed pre-trained Hover-Net (Graham et al. 2019) for cell segmentation and classification on those regional tissue phenotypes.This enabled us to obtain the cellular spatial composition of five cell types, including neoplastic, inflammatory, connective, dead, and non-neoplastic epithelial cells (Fig. 3c and f, Supplementary Fig. S6).It could be observed that these phenotypes depict differences in spatial cellular composition.For example, certain phenotype patches exhibit a high abundance of locally-aggregated inflammatory cells and a higher percentage of tumor (neoplastic) cells, while others show loosely distributed stromal (connective) cells interconnected with tumor cells.These observations align with those made by expert pathologists.The representative phenotypes significantly reveal distinct cellular spatial compositions and morphological patterns in pathological images.PhiHER2: phenotype-informed model for HER2 status prediction from WSIs It potentially promotes explainable insights into the spatial morphological heterogeneity of BC.We also conducted a further analysis on the prototype embeddings with model prediction scores to compare HER2positive (denoted in purple) with HER2-negative (blue) cases.A distinct separation is evident between the two groups according to the 2D density plots (Fig. 3d and g).The observations clearly indicate an association between pathologically morphological patterns and molecular HER2 status.

Effectiveness of the cross-attention module
Initial-PT demonstrated a clear advantage over Rand-PT, with AUC values of 0.785 on the HEROHE dataset and 0.892 on the Yale cohort (Supplementary Table S3).However, both Rand-PT and Initial-PT models were initialized randomly.This prompts us to investigate the degree to which prototype embeddings are affected by model optimization within the cross-attention module.We extracted the transformation embeddings of phenotypes across various linear layers (input, projection, and query) within the crossattention module.These embeddings were denoted as Raw, Projection, and Query, respectively.The uniform manifold approximation and projection was applied to the embeddings for dimensionality reduction, allowing visualization of phenotypes in a two-dimensional space (Fig. 4a).We found that the prototype embeddings from the same layer tend to cluster together, whereas the representations from different layers are well separated.The arrows roughly mark the distances between Raw and Query embeddings.This implies that the iteratively optimized cross-attention module guiding by prototypes can further facilitate the transformation of phenotypes into an alternative feature space.It explains why Initial-PT outperforms Rand-PT and validates the critical role of the cross-attention module in our PhiHER2 model.PhiHER2: phenotype-informed model for HER2 status prediction from WSIs i87

Robustness of classifier head strategies
In this experiment, we compared the performance of our PhiHER2 model when coupled with different classifier heads, covering those based on mean operation, attention scores, and transformer strategies (Fig. 4b, Supplementary Table S4).
We also reported the results of the baseline ABMIL model.The performance indicates that our PhiHER2 remains relatively robust with respect to different classifier head strategies and these models outperform the baseline ABMIL method across both the HEROHE dataset and the Yale cohort.It highlights that our phenotype-informed framework is insensitive to the classifier heads.Notably, the model equipped with the mean-based classifier head yielded the best performance compared to others.This appears to suggest that the PhiHER2 model has better consistency among instance embeddings within a single WSI.We thus turn to obtain the WSI-level representation vector before the classifier head and visualize them by embedding heatmaps (Fig. 4c, Supplementary Fig. S7).Our phenotype-informed aggregated features reveal a significant distinction between the HER2positive and negative groups (Fig. 4c).This is in line with the observations from Fig. 3d and g.Additionally, PhiHER2 displays greater embedding consistency than those of PMIL and non-phenotype-guided methods (ABMIL and CLAM).This results in other methods showing a greater dependency on the classifier head, whereas our PhiHER2 can facilitate feature aggregation and enhance performance through a concise mean-based classifier head strategy.

Efficiency with dual instance sampling
To assess the efficiency of the dual instance sampling strategy for our PhiHER2, we conducted experiments by varying the number of instances in random sampling and iterative mining.We established these sampling values by scaling the median value of the dataset accordingly (see Fig. 1c-d).For the HEROHE dataset, with a median value of 7248, we chose fixed sampling values of 10 000, 5000, 2000, 1000, 500, and 200 for comparison.For the Yale cohort, which has a smaller median value of 1376, we selected 2000, 1000, 500, 200, and 100.Besides, we performed experiments with no instance sampling to ensure a fair comparison.
The results are presented in Fig. 4d and e and Supplementary Fig. S8.We can observe the following empirical facts.The random instance sampling strategy brings performance improvements compared to the model with no instance sampling, and the latter also exhibits significant performance fluctuations.The model performance remains generally consistent when the number of instances in random sampling decreases.This shows that random instance sampling acting as a data augmentation in MIL enhances the training efficiency and performance of the model (Naik et al. 2020, Cao et al. 2023).The model with 500 instances of iterative mining achieved optimal performance, with a 2.3% improvement over the model with no iterative mining on the HEROHE dataset (Fig. 4e).Similar findings are observed from the results on the Yale cohort (Supplementary Fig. S8).These results underscore the contribution of our proposed dual instance sampling strategy.

WSIs interpretability analysis
For weakly supervised deep learning models in medical applications, the interpretability is vital and decisions should be visually explainable for clinicians.To this end, we implemented an overlapping heatmap visualization method to determine the importance of the region and interpret the morphological patterns associated with the molecular status of HER2.Specifically, for one WSI, the PhiHER2 model processed and generated instance-level predictions without employing the mean operation within its classifier head.The model's predicted class probabilities for each instance were obtained and normalized to fall within the interval of 0 and 1.These normalized scores of instances were then mapped to their corresponding spatial locations within the WSI and visualized as a heatmap, which effectively highlights the regions where the model exhibits the highest confidence.
Fine-grained heatmaps overlapped on WSIs from the Yale cohort and the HEROHE testset are visualized (Fig. 5a-d, Supplementary Fig. S9).One can find that the WSI heatmaps exhibit a high degree of consistency with the expert annotations of ROIs in HER2-positive and negative cases (Fig. 5cd).Our trained weakly supervised PhiHER2 model is able to identify and weight ROIs related to HER2 status using slidelevel labels only.We can observe the distribution of attention heatmaps on distinct tissue ROIs (Fig. 5e and f).Most of the high-contribution areas are located in tumor tissue.The representative patches with the highest probabilities mostly derive from tumor areas.This reveals the interpretability of which tissue patterns are more meaningful for WSI-level HER2 status prediction decision.It has the potential to provide clinicians deep insights into morphological patterns of region heterogeneity of HER2 expression.

Conclusion
In this work, we introduce PhiHER2, a phenotype-informed weakly supervised learning architecture designed for the prediction of HER2 status from BC WSIs.PhiHER2 leverages heterogeneous prototype embeddings to guide feature aggregation in the MIL framework.The proposed hierarchical prototype clustering module discovers typical phenotypes, which account for the tumor heterogeneity exhibited in WSIs.
Additionally, the cross-attention module enhances instance interactions and captures a feature space based on the representative phenotypes.The concept of phenotype guidance in the cross-attention mechanism proves to be flexible and robust across various classifier head strategies.The experiments demonstrate that the PhiHER2 model achieved a remarkable improvement in performance over existing methods on both the HEROHE dataset and the Yale cohort.Interpretative analysis of phenotype instances and WSI heatmaps offers valuable insights into the heterogeneity of tumor morphological patterns associated with the molecular status of HER2.Our model exhibits the potential to serve as a reliable framework for other studies, and its capability of handling more clustering algorithms and cross-dataset applications will be investigated in our future work.

Feature
extraction served as a transformation function T ð�Þ is a fundamental step in MIL.A contrastive learning-based feature extraction model(Wang et al. 2023), pre-trained on around 15 million patches from over 32 000 WSIs, has recently emerged as a universal feature extractor within the computational pathology community.We employ it for feature extraction on the instances.This enables us to obtain 2048-dimensional vectors from tiled 256 × 256 pixel patches.The extraction of lowdimensional features helps to reduce computational complexity and facilitates the modeling of MIL on instances within a slide.

Figure 1 .
Figure 1.Workflow overview with (a) WSIs processing, (b) the PhiHER2 architecture, and the data distributions of the number of patches from WSIs on (c) the HEROHE dataset and (d) the Yale cohort.

Figure 2 .
Figure 2. Evaluation performance and comparative analysis for HER2 status prediction.(a, b) ROCs for PhiHER2 and comparative models tested on the HEROHE and Yale cohort.(c, d) PR curves for our PhiHER2 and comparative models evaluated on both datasets.(e) The probability distribution of predictions on different HER2 status groups from the HEROHE dataset.

Figure 3 .
Figure 3. Visualization of (a) ablation results for cluster phenotype guidance with the cross-attention module evaluated on HEROHE and Yale dataset, (b, e) representative phenotypes derived from the hierarchical prototype clustering module on the HEROHE dataset, (c, f) cell segmentation and classification results overlaid on phenotype patches, where different colors mark different cell types, and (d, g) 2D density plots showing the distribution of phenotype embeddings and model prediction scores for different HER2 status groups, where the x-axis denotes the model prediction scores and the y-axis represents the specific feature values.

Figure 4 .
Figure 4. Visualization of (a) phenotype embeddings for 5-time models on the HEROHE dataset, (b) evaluation performance with different classifier head strategies on the HEROHE dataset and the Yale cohort, (c) WSI-level representation heatmaps for PhiHER2 and PMIL methods, where each row represents a specific dimension in the WSI-level feature vector, each column represents a sample with negative (blue) or positive (orange) HER2 status and the color in the heatmap indicates the value intensity of the feature vectors.(d-e) Illustration of data efficiency for dual instance sampling regarding the number of instances in (d) random sampling and (e) iterative mining on the HEROHE dataset.

Figure 5 .
Figure 5. WSI-level overlapping heatmap visualization for model interpretability.(a, b) Two pathological images with manual ROI annotations derived from the Yale cohort.(c, d) The corresponding attention heatmaps overlapped on raw WSIs.Large values (red) means a high contribution to the model's prediction, small values (blue) a low contribution.(e, f) Selected ROIs for zooming in to observe detailed tissue regions.Three tissue patches with the highest probabilities are also presented.

Table 1 .
Performance comparison of PhiHER2 and comparative methods on the HEROHE and Yale set.The results were evaluated and averaged across 5-time experiments.Mean and standard deviation are reported, and the best results are indicated in bold.